A Noob's guide to Text Classification Using NLTK, Scikit and Gensim

The Summer of 2015 was very productive! I got an opportunity to work with a startup company on a Text Classification problem. We were dealing with a very large HTML corpus which made it all the more challenging to load, process and make sense of the data. This tutorial(hopefully) will try and present a more "VERBOSE VERSION TO TEXT CLASSIFICATION" and discuss few libraries, techniques and hacks that could come in handy while working on Text Classification.

Here is a bit of a background about me before we dive right in- "I am a Computer Science Grad student at The University at Buffalo, SUNY, New York. I hold my bachelor's degree in Telecommunication Engineering. I work with Dr.Kris Schindler on developing an augmentative communication system for the speech-impaired using Brain Computer Interfaces and Natural Language Processing (NLP). "

Pre-requisite

The tutorial assumes that you have some basic working knowledge of Programming in Python. If you have never programmed in python before, then pause this tutorial for a second and check out A byte of Python . The ebook serves as a tutorial or guide to the Python language for the beginner audience. I would also highly recommend the Programming foundations with python by Udacity .
This tutorial also assumes you to be familiar with some basic know-hows of machine learning especially the Classification algorithms such as LogisticRegression, SGDClassifier and Multinomial Naive Bayes. Then again i'll provide you with resources that'll help you understand the theory wherever required. If you are looking for a good ML tutorial online, i would highly recommend taking The Introduction to Machine Learning by Udacity and Introduction to Machine Learning by Andrew Ng courses.
If you are new to ipython notebook, head here

Let' s get started shall we?

Installation Instruction

  1. Download Anaconda from Here.
    Anaconda comes pre packaged with all the libraries that we will be needing in this particular tutorial. We will be using Gensim, NLTK and Scikit-learn in particular.
    Windows Installation Instruction
    Download the Windows Installer from the link .Voila!!
    Linux Installation Instruction
    Download from the link provided and in your terminal window type the following, replacing the file path, name with the path , name of your downloaded install file. Follow the prompts on the installer screens. If unsure about any setting, simply accept the defaults, as they can all be changed later: bash ~/Downloads/Anaconda-2.3.0-Linux-x86_64.sh
    Mac OSX Installation Instruction
    Download and install the Setup file from the link. NOTE: You may see a screen that says “You cannot install Anaconda in this location. The Anaconda installer does not allow its software to be installed here.” To fix this click the “Install for me only” button with the house icon and continue the installation.



FYI: I am running a 64 bit Ubuntu 14.04 LTS with Intel Core i5 CPU and 8 Gigs of RAM.

About the Dataset

We will be using the Reuters Dataset that comes bundled with the nltk package. You can downlaod the dataset from the following link.
If you wish to download all the dataset that comes bundled with nltk then run the code snippet below. You'll then be prompted by the NLTK downloader. Choose and download all the packages. It might take some time for all the corpora to be downloaded.


In [ ]:
import nltk
nltk.download()

The First Step

The first step in any machine learning problem is to go through the dataset and understand the structure. This will give us a better clarity when we start modelling the input data. We begin by extracting the dataset and notice that it contains a training, test folder. We observe that there are about 90 categories in the reuters dataset with about 7769 documents in the training set and 3019 documents in the test set. The distribution of categories in the corpus is highly skewed, with 36.7% of the documents in the most common category, and only 0.0185% (2 documents) in each of the five least common categories. In fact, the original data source is even more skewed---in creating the corpus, any categories that did not contain at least one document in the training set and one document in the test set were removed from the corpus by its original creator. The Readme File should give you more information about the dataset.The cats.txt file contains the mapping of input filename to their respective category. There is also a stopword.txt file that contains a list of stop words. We will discuss more about the stopwords in the coming sections.
The first thing to do is to load the dataset. Though there is CategorizedPlaintextCorpusReader function in nltk to do this, I am more inclined towards using the load_files function provided with scikit-learn to load the data. This is mainly because it is much easier to handle data in scikit as the load_files function returns a data bunch. Bunch in python let's you acess python dicts as objects.

Loading the dataset using Scikit-learn

Scikit-learn is an amazing library to quickly code up your machine learing project. It provides some very easy and useful functions to pretty much do everything from classification and regression to clustering. We are going to dwell right into scikit-learn and use the sklearn.datasets.load_files function to do this. Please note that the load_files expects a certain directory structure for it to work. load_files load's the text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder structure such as the following:

container_folder/
    category_1_folder/
        file_1.txt 
        file_2.txt ... 
        file_42.txt
    category_2_folder/
        file_43.txt 
        file_44.txt ...

The folder names (categoty_1_folder, category_2_folder etc.) are used as target labels. The individual file names are not very important.

Check out the following documentation page for detailed explanation on load_files module.


In [ ]:
from sklearn.datasets import load_files

Now we can't begin to directly port the Reuters dataset. We need to segregate the training examples(documents) into its respective category folder for scikit to load it. Python OS module to the rescue!!


In [ ]:
import os
import os.path
from os.path import join
import shutil 
# root path of the dataset
rootpath = '/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/reuters/'
# the cats.txt file that contains the mapping between the file and their respective categories
catpath= '/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/reuters/cats.txt'
# path were the newly setup dataset will reside
newpath= '/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/'
# using with to open a file will automatically handle the closing of file handler
with open(catpath) as catsfile:
    for line in catsfile:
        key = line.split()[0]  # path and filename
        value= line.split()[1] # category
        # create directory if it does not exists
        if not os.path.exists(join(newpath,key.split('/')[0],value)): os.makedirs(join(newpath,key.split('/')[0],value))         
        #shutil.copy2(source,destination) lets you copy the files from the source directory to destination
        shutil.copy2(join(rootpath,key), join(newpath,key.split('/')[0],value))
        
print "DONE"

VOILA!

Now we have the input data in the required structure ready. Let us go ahead and port it with sklearn.datasets.load_files.


In [ ]:
from sklearn.datasets import load_files
training_data = load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/training/')
print "Loaded " + str(len(training_data.filenames)) + " Training Documents "

As discussed earlier, the load_files function returns a data bunch which consists of {target_names,data,target, DESCR, filenames}. We can access a particular file from the training example as follows:


In [ ]:
# category of first document in the buch
print "TARGET NAME : " + training_data.target_names[training_data.target[0]]
# data of the first document in the bunch
print "DATA : " + training_data.data[0][:500]
# Target value of the first document in the bunch
print "TARGET : " + str(training_data.target[0])
# filename of the first document in the bunch
print "FILENAME: " + training_data.filenames[0]

Feature Extraction

Key thing to realizing a machine learning system is feature extraction. Often times 80% of your time and effort in a Machine Learning project is spent on realizing technique that give us good features to work with. Even an effective algorithm becomes useless when we use bad feature selection. The most popular technique that is used in text classification is the Bag of Words representation which converts the input text data into its numeric represenation. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. *For example, consider the following two sentences:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat"
From these two sentences, our vocabulary is as follows:
{ the, cat, sat, on, hat, dog, ate, and }
" To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:
{ the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}


*Example and explanation taken from Kaggle

Let us fire up the sklearn.feature_extraction.text and start building the bag of words representation for the training data.
While constructing the bag of words representation, we come across words such as "the", "a", "am".. which do not add any meaning. So these words are filtered out before processing the input text. This is primarily done to reduce the dimensionality of the data. We begin by importing the stopwords.txt file from the reuters corpus into a set and passing that as a argument to CountVectorizer.


In [ ]:
stopwords_list = []
with open('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/reuters/stopwords') as f:
    for line in f:
        stopwords_list.append(line.strip())
        
print "Stop Words List :"
print stopwords_list[:10] 
print "...."

In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
import datetime,re


print ' [process  started: ' + str(datetime.datetime.now()) + ']'
# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  
count_vect = CountVectorizer(analyzer = "word", stop_words= set(stopwords_list))

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
X_train_count= count_vect.fit_transform(training_data.data)
print ' [process  ended: ' + str(datetime.datetime.now()) + ']'
print "Created a Sparse Matrix with " + str(X_train_count.shape[0]) + " Documents and "+ str(X_train_count.shape[1]) + " Features"

What CountVectorizer does is create a sparse matrix where every word in the input corpus is mapped to a unique integer. The index value of a word in the vocabulary is linked to its frequency in the whole training corpus. For instance the word oil is assigned a unique integer of 16654


In [ ]:
print count_vect.vocabulary_.get(u'oil')

From Occurences to Frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics. To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies. Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.
Refer to the following Link for a detailed explanation on TFIDF


In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer
print '[process started: ' + str(datetime.datetime.now()) + ']'
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_count)
print '[process ended: ' + str(datetime.datetime.now()) + ']'
print X_train_tfidf.shape

Naive Bayes Classifier


I am not going to bore you with the details here. Let us just say that Naive Bayes is a very solid yet simple algorithm when it comes to classifying text and by default should be the very first algorithm that you must try out. Scikit has a very good implementation of Naive Bayes and we will be using it to classify our text. For the curious souls more about NB Classifier .


In [ ]:
from sklearn.naive_bayes import MultinomialNB
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# you fit the NB model
clf = MultinomialNB().fit(X_train_tfidf, training_data.target)
print ' [Classification ended: ' + str(datetime.datetime.now()) + ']'

Testing the Performance of the Classifier

We now test the performance of the classifier in terms of the accuracy. What we do is we begin to import the test documents using the load_files the same way we did for the training documents. We then apply the CountVectorizer and TFIDF transformation to the test documents. The predict function will output the predicted label of the test document.


In [ ]:
from __future__ import division
import numpy as np
test_data= load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/test/',shuffle=True, encoding='ISO-8859-2')
count=0
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
for i in range(0,len(test_data.filenames)):
    docs_test = [test_data.data[i]]
    # Apply the count vectorizer we used to fit the training data on the test data
    doc_test_counts = count_vect.transform(docs_test)
    # apply the tfidf transformation
    doc_test_tfidf = tfidf_transformer.transform(doc_test_counts)
    # predict 
    predicted = clf.predict(doc_test_tfidf)
    # Predicted label based on the classifier prediction above
    predicted_label=training_data.target_names[predicted]
    # True label of test document
    true_label=test_data.target_names[test_data.target[i]]
    # calculate the accuracy
    if predicted_label==true_label:
        count+=1
print ' [Classification Ended: ' + str(datetime.datetime.now()) + ']'
print "ACCURACY : " + str(count/len(test_data.filenames))

Did you just say 64.392182842 % Accuracy?


The good news is we could do much better than this!!!!

The bad performance of Naive Bayes is a clear indication of the feature selection method that we chose. Hence in this section we will spend some time tweaking the features to see if we can improve the accuracy of the classifier. Fortunately the CountVectorizer provides arguments that we can tweak. Let us dive right in.


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
print count_vect

We can see the default arguments of CountVectorizer by printing the CountVectorizer instance as shown above. The first and foremost parameter that we must pay attention to is lowercase
This is a very important argument for dimensionality reduction. Setting lowercase=False will treat a word Cat and cat differently while they may mean the same thing. Thus it is advisable to either lowercase the input data or set lowercase=True in CountVectorizer. CountVectorizer sets this parameter as True by default and hence we may not have to explicitly set is while intializing.
The max_df and min_df help filter out the most common and rare words respectively. We avoid using a word that is present in almost every document as they may not be very indicative of the class that the text belongs to. The same argument can be applied to words that are present very rarely in the input corpus. Thus filtering them out leads to a better set of features which could help us better categorize a text.
min_df can take integer or float values. setting min_df=3 tells the CountVectorizer to ignore words that exits in strictly less than or equal to 3 documents. If float, the parameter represents a a percentage of documents, integer absolute counts.
max_df ignores terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.
ngram is a contiguous sequence of n items from a given set of texts. ngram representation can sometimes make more sense when we are analyzing a large chunk of texts as it spits out Collocation. Using a bigram or tri-gram feature may give us features such as "United States" and "US Vice President" which could improve the performance of a classifier significantly compared to using unigram tokens such as ["United", "States" , "US" ,"Vice" and "President" ]. We will be using unigram, bigram and trigram tokens by setting the ngram_range parameter as a tuple (1,3).


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
import datetime,re

print ' [process  started: ' + str(datetime.datetime.now()) + ']'
# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  
count_vect = CountVectorizer(analyzer = "word", stop_words= set(stopwords_list), min_df=3, max_df=0.5, lowercase=True,
                            ngram_range=(1,3))
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
X_train_count= count_vect.fit_transform(training_data.data)
print ' [process  ended: ' + str(datetime.datetime.now()) + ']'
print "Created a Sparse Matrix with " + str(X_train_count.shape[0]) + " Documents and "+ str(X_train_count.shape[1]) + " Features"

Notice how the number of features has gone up from 25834 in the previous case to 55469. This is because of using ngram features. We now use TFIDF on the word counts.


In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer
print '[process started: ' + str(datetime.datetime.now()) + ']'
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_count)
print '[process ended: ' + str(datetime.datetime.now()) + ']'
print X_train_tfidf.shape

In [ ]:
from sklearn.naive_bayes import MultinomialNB
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# you fit the NB model
clf = MultinomialNB().fit(X_train_tfidf, training_data.target)
print ' [Classification ended: ' + str(datetime.datetime.now()) + ']'

In [ ]:
from __future__ import division

#load test data
test_data= load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/test/',shuffle=True, encoding='ISO-8859-2')
# variable to track the accuracy
count=0
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# Iterate over the test document file by file
for i in range(0,len(test_data.filenames)):
    docs_test = [test_data.data[i]]
    # Apply the count vectorizer we used to fit the training data on the test data
    doc_test_counts = count_vect.transform(docs_test)
    # apply the tfidf transformation
    doc_test_tfidf = tfidf_transformer.transform(doc_test_counts)
    predicted = clf.predict(doc_test_tfidf)
    # Predicted Target label
    predicted_label=training_data.target_names[predicted]
    # True Target label of test document
    true_label=test_data.target_names[test_data.target[i]]
    # calculate the accuracy 
    if predicted_label==true_label:
        count+=1
print ' [Classification Ended: ' + str(datetime.datetime.now()) + ']'
print "ACCURACY : " + str(count/len(test_data.filenames))

From 64 to 67 % Accuracy


Now there isn't much tweak that you can do further to improve the accuracy of Naive Bayes. Though using alternate models such as Logistic Regression and Stochastic Gradient Descent may give better results, i want see if i can maybe improve the accuracy of the exisiting model by using some simple techniques.
Instead of predicting one target label, we can tweak the program to maybe output the top 5 (say) predictions for every test document based on the probability value. This is quite easy to implement on the existing code and requires us to use the predict_proba function instead of the predict function in MultinomialNB. I'll show you in a second how to do this.


In [ ]:
from __future__ import division
#load test data
test_data= load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/test/',shuffle=True, encoding='ISO-8859-2')
# variable to track the accuracy
count=0
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# Let's create a dictionary that stores the top 5 predictions for every test document 
# Hence the key will be the document number and the value will be a list containing top 5 predictions
NBPredictions={}

# Iterate over the test document file by file
for i in range(0,len(test_data.filenames)):
    docs_test = [test_data.data[i]]
    # Apply the count vectorizer we used to fit the training data on the test data
    doc_test_counts = count_vect.transform(docs_test)
    # apply the tfidf transformation
    doc_test_tfidf = tfidf_transformer.transform(doc_test_counts)
    probability = clf.predict_proba(doc_test_tfidf).flatten()
    #print probability
    # sort according to the probability value in descending order
    a= (-probability).argsort()[:5]
    #create a list of predicted labels
    predicted=[]
    for target in a:
        predicted.append(training_data.target_names[target])
    
    true_label = test_data.target_names[test_data.target[i]]
    NBPredictions[i]=predicted
    if true_label in predicted:
        count+=1
    
print ' [Classification Ended: ' + str(datetime.datetime.now()) + ']'
print "ACCURACY : " + str(count/len(test_data.filenames))

WOAH!

We see here that the accuracy of the classifier is now 82.67% for the top 5 predictions. However in real text classification system, you would need human intervention to choose one target label out of the top 5 predictions which may be cumbersome. Our idea here is to automate the decision making, hence printing the top 5 predictions may not be a good idea.
What we will do now is apply the Latent Semantic Indexing(LSI) similarity measure between the test document and training document that belong to classes that were predicted in the Naive Bayes Model (which is the top 5 predictions that we made earlier) . LSI is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts. It is called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria. *Source
Fortunately Gensim comes to our rescue here. Gensim is a python package by Radim Rehurek which is mainly used for statistical semantic analysis and topic modelling. Gensim provides us with a nice interface to carry out LSI similarity measure on the input corpus.


In [ ]:
# Print the top 5 predictions made for the first document in the test set by MultinomialNB model
NBPredictions[0]

While doing the LSI similarity lookup, we will load only those categories from the training set that belong to the top 5 predictions. i.e for Test Document 0, since we know that the top 5 predictions are ['acq', 'earn', 'crude', 'trade', 'interest'], we'll load only training data that belong to these categories. The sklearn.datasets.load_files provides a categories parameter to do this. Categories takes in a list of classes and loads up training data only from those classes.


In [ ]:
cosine_data = load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/training/',
                        categories=NBPredictions[0])
print "Loaded " + str(len(cosine_data.filenames)) + " Files belonging to " + str(len(cosine_data.target_names)) +" Classes "
print "Classes are " + str(list(NBPredictions[0]))

Now let us put it all together.

Gensim

Note that corpus that we loaded using load_files above resides fully in memory, as a plain Python Bunch. When the input corpus becomes very big with millions of documents , storing all of them in RAM may be a bad design decision and could crash the system. Gensim overcomes this by using a technique called streaming. While streaming a document is loaded and processed in memory one by one. Although the output is the same as for the plain Python bunch, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. The implementation of a gensim corpus class is as shown below:


In [ ]:
# to print the gensim logging in ipython notebook
import logging
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

In [ ]:
import gensim
import os
from os.path import join
from gensim import utils
import datetime, re, sys
from nltk.tokenize import RegexpTokenizer
from gensim import similarities,models, matutils
from gensim.similarities import MatrixSimilarity, SparseMatrixSimilarity, Similarity
from gensim import corpora
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

# Tokenizer
def text_tokenize(text):        
    tokens = tokenizer.tokenize(text.lower())
    # return the transformed corpus
    out=[]
    for token in tokens:
        # we filter out tokens that exist in stopwords list 
        if token not in stopwords_list:
            out.append(token)
            
    return out  # returns a list

class MyCorpus(gensim.corpora.TextCorpus):
    def get_texts(self):
        for filename in self.input:
            yield text_tokenize(open(filename).read())

To stream the input corpus we store the absolute path of the files as a list and pass that to the MyCorpus get_texts function. This function implements the streaming interface.

NOTE

I am going to run it only for the first 50 files in the test_data to show you the output. Iterating through the entire test set may take quite sometime.


In [ ]:
import sys
accCount=0 # Keep track of the accuracy
# Iterate over the test documents
# to iterate over the entire test_data replace range(0,30) with (0,len(test_data.filenames))
for i in range(0,50):    
    print "CLASSIFICATION " + str(i)
    # Root path where training data exists
    root_path='/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/training/'
    filename=[]  # list containing absolute path name of all the filenames 
    targetvar=[] # list that contains target label
    
    for j in NBPredictions[i]: # pull up the top5 predictions
        container_path=join(root_path,j)  # container path is root_path/class_name
        for root, dirs, files in os.walk(container_path, topdown=True):
            for name in files:
                targetvar.append(root[root.rfind('/')+1:])
                filename.append(join(root,name))                
                     
    # import gensim here to carry out LSI
        
    
    # serialize the corpus and store it as a .mm file
    
    mycorpus= MyCorpus(filename)
    corpora.MmCorpus.serialize('/home/arvindramesh/Desktop/Internship/Experimental/corpus.mm', mycorpus)
    corpus= corpora.MmCorpus('/home/arvindramesh/Desktop/Internship/Experimental/corpus.mm')
    dictionary = mycorpus.dictionary
    
    # apply tfidf to the corpus
    tfidf = models.TfidfModel(corpus,id2word=dictionary)
    # Apply LSI to tfidf of corpus
    lsi = models.LsiModel(tfidf[corpus],id2word=dictionary)
    # create a index for fast lookup
    index = similarities.SparseMatrixSimilarity(lsi[tfidf[corpus]],corpus.num_terms)
    
    
    # Testing 
    tokens= tokenizer.tokenize(test_data.data[i])# tokenize the data
    vec_bow = dictionary.doc2bow(tokens) # do a bag of words represenation on the data
    sims = index[lsi[tfidf[vec_bow]]] # compute the lsi similarity measure
    a= (sorted(enumerate(sims),key=lambda item: -item[1])) # sort the similarity measure
    # a now contains a list of tuples that contains the (document id,similarity score)
    
    predicted = targetvar[a[0][0]]
    true = test_data.target_names[test_data.target[i]]
    print "Predicted : " + predicted
    print " True: " + true
    sys.stdout.flush()
    if true==predicted:
        accCount+=1
    
print "ACCURACY : " + str(accCount/50)

Observations made from LSI

LSI is a very interesting technique to use while classifying text. LSI has not given us an enhanced performance compared to MultinomialNB,but it has helped establish some interesting similarity relationships between classes.
For a test document that belongs to the meal-feed class, the LSI model predicts corn. Now on closer inspection we find that the predicted class makes a intuitive sense and in fact we are not completely wrong to classify it as meal-feed as corn is infact a primary meal source in some countries such as China, Brazil and Mexico. Similarly Bop(balance of payments) is very close to trade and the same goes for Crude and nat-gas (Natural Gas) . Since these classes are semantically very close to each other, any misclassifications arising out of such a scenario would not be very expensive. Though the Accuracy numbers of a LSI model is 72% but the predictions in LSI are very intuitive and can make a very good business sense when deployed in real time systems.

Wait but why go through all this trouble?

Well i hear you! There are some interesting models such as Logistic Regression and Stochastic Gradient Descent at your disposal. These methods are very popular in literature. However i wanted to stick with Naive Bayes as it is very simple and elegant.
In my pursuit to try and make the NB model work, i explored some very interesting NLP concepts such as Stemmers and Lemmatizers, Collocations, , Document Summarizers (which i will try to cover in a seperate tutorial soon) and Stanford NER (Named Entity Recognizer). NER helps identify what's "important" in a text document. Also called entity extraction, this process involves automatically extracting the names of persons, places, organizations, and potentially other entity types out of unstructured text. Building an NER classifier requires lots of annotated training data and some fancy machine learning algorithms. These helped establish a very cool news article summarizer that i am working on currently. The summarizer was made with an intention to summarize large chunk of texts in order to reduce the dimensionality of datasets for the classifer models. I hope to add other capabilities and make a nefty text summarizer for me to summarize those boring lecture notes ;)
I also experimented with the gensim package which helped me try out the LSI model on the input with promising results. Deep Learning techniques such as Google's Word2Vec are also interesting alternatives for the bow of word representation that you can use on the existing model.

Beyond Naive Bayes- Logistic Regression and Stochastic Gradient Descent

Logistic regression is one of my personal favorite algorithm when it comes to multi-class classification. This model is so simple and yet gives amazing result out of the box. The best starting point would be this Wikipedia entry. Fortunately for us we have the Logistic Regression implementation in scikit-learn which we will code up quickly in just a minute. You can check out this page on scikit-learn that talks about using the Stochastic Gradient Descent as a classifier. So let's try both of them out!


In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

print '[process started: ' + str(datetime.datetime.now()) + ']'
count_vect= CountVectorizer(analyzer = "word", stop_words= set(stopwords_list), tokenizer=text_tokenize, min_df=3,
                           max_df=0.5,ngram_range=(1,3),lowercase=True)
X_train_count = count_vect.fit_transform(training_data.data)
print '[process ended: ' + str(datetime.datetime.now()) + ']'
print "Transformed " + str(X_train_count.shape[0]) + " documents  with " + str(X_train_count.shape[1]) + " Features"

In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer
print '[process started: ' + str(datetime.datetime.now()) + ']'
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_count)
print '[process ended: ' + str(datetime.datetime.now()) + ']'
print X_train_tfidf.shape

Fit SGDClassifier and LogisticRegression


In [ ]:
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# Fit the logistic regression and SGD 
# For the logistic regression we will use the one vs rest classifier
clf1 = LogisticRegression(class_weight='auto',solver='newton-cg',multi_class='ovr').fit(X_train_tfidf, training_data.target)
clf2 = SGDClassifier(loss='log',class_weight='auto').fit(X_train_tfidf,training_data.target)
print ' [Classification ended: ' + str(datetime.datetime.now()) + ']'

Test the classifier


In [ ]:
from __future__ import division

#load test data
test_data= load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/test/',shuffle=True, encoding='ISO-8859-2')
# variable to track the accuracy
Logisticount=0
SGDcount=0
print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# Iterate over the test document file by file
for i in range(0,len(test_data.filenames)):
    docs_test = [test_data.data[i]]
    # Apply the count vectorizer we used to fit the training data on the test data
    doc_test_counts = count_vect.transform(docs_test)
    # apply the tfidf transformation
    doc_test_tfidf = tfidf_transformer.transform(doc_test_counts)
    predicted1 = clf1.predict(doc_test_tfidf)
    predicted2 = clf2.predict(doc_test_tfidf)
    # Predicted Target label
    predicted_label1=training_data.target_names[predicted1]
    predicted_label2 = training_data.target_names[predicted2]
    # True Target label of test document
    true_label=test_data.target_names[test_data.target[i]]
    # calculate the accuracy 
    if predicted_label1==true_label:
        Logisticount+=1
    if predicted_label2==true_label:
        SGDcount+=1
print ' [Classification Ended: ' + str(datetime.datetime.now()) + ']'
print "ACCURACY of LogisticRegression : " + str(Logisticount/len(test_data.filenames))
print "ACCURACY of SGDClassifier : " + str(SGDcount/len(test_data.filenames))

Wohoooo!! We have a winner!!

Other metrics for classifier performance

Accuracy is a good measure but there are few other metrics such the f1 score that help us visualize the performance of a classifier better.
F1 Score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.
Precision measures whether predictions are correct while Recall measures whether everything that should be predicted, is predicted. Precision and Recall are inversely proportional and hence f1 scores balances these two. For us, Scikit-learn has an implementation to compute the f1 score where the f1 score formula is given by
F1 = 2*(precision * recall) / (precision + recall)


In [ ]:
from __future__ import division
#load test data
test_data= load_files('/media/arvindramesh/CCF02769F02758CA/TextClassification-NewsGroup/test/',shuffle=True, encoding='ISO-8859-2')
# variable to track the accuracy
Logisticount=0
SGDcount=0
truelabel=[]# list to hold true label
Logistic=[] # list to hold predicted label of LogisticRegression Classifier
Sgd=[]# list to hold predicted label of SGDClassifier

print ' [Classification Started: ' + str(datetime.datetime.now()) + ']'
# Iterate over the test document file by file
for i in range(0,len(test_data.filenames)):
    docs_test = [test_data.data[i]]
    # Apply the count vectorizer we used to fit the training data on the test data
    doc_test_counts = count_vect.transform(docs_test)
    # apply the tfidf transformation
    doc_test_tfidf = tfidf_transformer.transform(doc_test_counts)
    predicted1 = clf1.predict(doc_test_tfidf)
    predicted2 = clf2.predict(doc_test_tfidf)
    # Predicted Target label
    predicted_label1=training_data.target_names[predicted1]
    Logistic.append(predicted_label1)
    predicted_label2 = training_data.target_names[predicted2]
    Sgd.append(predicted_label2)
    # True Target label of test document
    true_label=test_data.target_names[test_data.target[i]]
    truelabel.append(true_label)
    # calculate the accuracy 
    if predicted_label1==true_label:
        Logisticount+=1
    if predicted_label2==true_label:
        SGDcount+=1
print ' [Classification Ended: ' + str(datetime.datetime.now()) + ']'

In [ ]:
# create a dictionary that maps class label with interger 
# this will help us while calculating f1 scores
dictmap={}
for i in range(0,len(training_data.target_names)):
    dictmap[str(training_data.target_names[i])]=i

In [ ]:
# Create a vector of target for true and predicted label
y_true = [dictmap[str(x)] if dictmap.has_key(x) else 84 for x in truelabel]
y_pred_logistic = [dictmap[str(x)] for x in Logistic]
y_pred_SGD = [dictmap[str(x)] for x in Sgd]
print "Created Predicted Target labels"

Classification Report for LogisticRegression


In [ ]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred_logistic, target_names=training_data.target_names))

Classification Report for SGDClassifier


In [ ]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred_SGD, target_names=training_data.target_names))

What Next?

Hopeful in the next set of tutorials i'll cover some Web scraping tutorials along with a few NLP concepts such as stemmers,lemmatizers and Named Entity Recognizer which will help us make better feature selection. We'll also work towards creating an Ensemble Classifier wherein we will combine all the 3 models viz Multinomial Naive Bayes, LogisticRegression and SGDClassifier to create a more accurate model for text classification. I would love to hear from you. Shoot me an email at aravindk@buffalo.edu for any suggestions, changes, comments and feedback. <img src='http://www.archtomato.com/images/force.jpg'\>


In [ ]: